Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
translated by 谷歌翻译
对于在现实世界中运营的机器人来说,期望学习可以有效地转移和适应许多任务和场景的可重复使用的行为。我们提出了一种使用分层混合潜变量模型来从数据中学习抽象运动技能的方法。与现有工作相比,我们的方法利用了离散和连续潜在变量的三级层次结构,以捕获一组高级行为,同时允许如何执行它们的差异。我们在操纵域中展示该方法可以有效地将离线数据脱落到不同的可执行行为,同时保留连续潜变量模型的灵活性。由此产生的技能可以在新的任务,看不见的对象和州内转移和微调到基于视觉的策略,与现有的技能和仿制的方法相比,产生更好的样本效率和渐近性能。我们进一步分析了技能最有益的方式以及何时:他们鼓励定向探索来涵盖与任务相关的国家空间的大区域,使其在挑战稀疏奖励环境中最有效。
translated by 谷歌翻译
连续控制设置中的复杂顺序任务通常需要代理在其状态空间中成功遍历一组“窄段”。通过以样本有效的方式解决具有稀疏奖励的这些任务对现代钢筋(RL)构成了挑战,由于问题的相关的长地平性,并且在学习期间缺乏充足的正信号。已应用各种工具来解决这一挑战。当可用时,大型演示可以指导代理探索。后威尔同时释放不需要额外的信息来源。然而,现有的战略基于任务不可行的目标分布探索,这可以使长地平线的解决方案不切实际。在这项工作中,我们扩展了后视可释放的机制,以指导沿着一小组成功示范所暗示的特定任务特定分布的探索。我们评估了四个复杂,单身和双臂,机器人操纵任务的方法,对抗强合适的基线。该方法需要较少的演示来解决所有任务,并且达到明显更高的整体性能作为任务复杂性增加。最后,我们研究了提出的解决方案对输入表示质量和示范人数的鲁棒性。
translated by 谷歌翻译
We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN's latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expressiveness power for high-fidelity reconstruction. On the other hand, high-rate latent spaces result in degradation in editing quality. In this work, to achieve high-fidelity inversion, we learn residual features in higher latent codes that lower latent codes were not able to encode. This enables preserving image details in reconstruction. To achieve high-quality editing, we learn how to transform the residual features for adapting to manipulations in latent codes. We train the framework to extract residual features and transform them via a novel architecture pipeline and cycle consistency losses. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements. Code: https://github.com/hamzapehlivan/StyleRes
translated by 谷歌翻译
Artificial Intelligence (AI) and its applications have sparked extraordinary interest in recent years. This achievement can be ascribed in part to advances in AI subfields including Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Deep learning, a sub-field of machine learning that employs artificial neural network concepts, has enabled the most rapid growth in these domains. The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning. In this review paper, we provide a thorough and an extensive review of the state of the arts approaches, key models design principles and discuss existing datasets, methods, their problem formulation and evaluation measures for VQA and Visual reasoning tasks to understand vision and language representation learning. We also present some potential future paths in this field of research, with the hope that our study may generate new ideas and novel approaches to handle existing difficulties and develop new applications.
translated by 谷歌翻译
通过脑电图信号的情绪分类取得了许多进步。但是,诸如缺乏数据和学习重要特征和模式之类的问题始终是具有在计算和预测准确性方面改进的领域。这项工作分析了基线机器学习分类器在DEAP数据集上的性能以及一种表格学习方法,该方法提供了最新的可比结果,从而利用了性能提升,这是由于其深度学习架构而无需部署重型神经网络。
translated by 谷歌翻译
发现广泛使用的深度学习模型的稳健性差。几乎没有噪音可以欺骗最先进的模型来做出错误的预测。尽管有很多高性能攻击生成方法,但其中大多数直接在原始数据中添加了扰动,并使用L_P规范对其进行测量;这可能会破坏数据的主要结构,从而产生无效的攻击。在本文中,我们提出了一个黑框攻击,该攻击不是修改原始数据,而是修改由自动编码器提取的数据的潜在特征;然后,我们测量语义空间中的噪音以保护数据的语义。我们在MNIST和CIFAR-10数据集上训练了自动编码器,并使用遗传算法发现了最佳的对抗扰动。我们的方法在MNIST和CIFAR-10数据集的前100个数据上获得了100%的攻击成功率,而扰动率较小。
translated by 谷歌翻译
深度学习越来越多地在医疗保健中获得迅速采用,以帮助改善患者的结果。在医学图像分析中,需要进行广泛的培训,以获得必要的专业知识,以成为值得信赖的从业者。但是,尽管深度学习技术继续提供最先进的预测性能,但阻碍医疗保健中这一进展的主要挑战之一是这些模型推理机制的不透明性质。因此,归因在建立对利益相关者的信心中对深度学习模型为临床决策做出的预测的信心至关重要。这项工作试图回答以下问题:深神网络模型在医学图像中学到什么?从这个角度来看,我们使用基于自适应路径的梯度积分技术提出了一个新颖的归因框架。结果表明,通过允许他们了解输入预测相关结构,发现新的生物标志物并揭示潜在的模型偏见来提高领域专家的信任,以改善医疗保健结果。
translated by 谷歌翻译
由于钻孔对准的困难以及任务的固有不稳定性,在手动完成时,在弯曲的表面上钻一个孔很容易失败,可能会对工人造成伤害和疲劳。另一方面,在实际制造环境中充分自动化此类任务可能是不切实际的,因为到达装配线的零件可以具有各种复杂形状,在这些零件上不容易访问钻头位置,从而使自动化路径计划变得困难。在这项工作中,开发并部署了一个具有6个自由度的自适应入学控制器,并部署在Kuka LBR IIWA 7配件上,使操作员能够用一只手舒适地在机器人上安装在机器人上的钻头,并在弯曲的表面上开放孔,并在弯曲的表面上开放孔。通过AR界面提供的玉米饼和视觉指导的触觉指导。接收阻尼的实时适应性在自由空间中驱动机器人时,可以在确保钻孔过程中稳定时提供更高的透明度。用户将钻头足够靠近钻头目标并大致与所需的钻探角度对齐后,触觉指导模块首先对对齐进行微调,然后将用户运动仅限于钻孔轴,然后操作员仅将钻头推动钻头以最小的努力进入工件。进行了两组实验,以定量地研究触觉指导模块的潜在好处(实验I),以及根据参与者的主观意见(实验II),提出的用于实际制造环境的PHRI系统的实际价值。
translated by 谷歌翻译
扫描像素摄像机是一种新型的低成本低功率传感器,不受衍射限制。它作为扫描过程中从场景的各个部分提取的样品序列产生数据。它可以提供非常详细的图像,而牺牲了采样和缓慢的图像获取时间。本文提出了一种新的算法,该算法允许传感器在此序列的过程中调整采样量。这可以通过最大程度地减少图像和传输场景所需的带宽和时间来克服这些限制,同时保持图像质量。我们检查了图像分类和语义分割的应用,与完全采样的输入相比,能够获得相似的结果,而使用样本少80%
translated by 谷歌翻译